Skip to content

[manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness#4025

Merged
openshift-merge-bot[bot] merged 5 commits into
openshift-kni:release-4.21from
Tal-or:4.21_cherry_pick_manual_OCPBUGS-84690
Jun 18, 2026
Merged

[manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness#4025
openshift-merge-bot[bot] merged 5 commits into
openshift-kni:release-4.21from
Tal-or:4.21_cherry_pick_manual_OCPBUGS-84690

Conversation

@Tal-or

@Tal-or Tal-or commented May 14, 2026

Copy link
Copy Markdown
Collaborator

manual cherry-pick of #3843

Tal-or and others added 4 commits May 14, 2026 16:42
Previously, MachineConfigsState returned a single wait function for all
pools - either IsMachineConfigPoolUpdated or
IsMachineConfigPoolUpdatedAfterDeletion - chosen globally based on
whether any pool had custom SELinux policy enabled. This broke mixed
configurations where some pools use custom policy and others use the
built-in default.

Each MachineConfigObjectState now carries its own WaitForUpdated
function and pool name. The controller builds a per-pool wait map so
each pool is checked with the correct wait logic independently.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Co-Authored-By: Francesco Romani <fromani@redhat.com>
AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
When an MCP has spec.paused=true, MCO will not apply pending
MachineConfig changes to its nodes. This leaves the MCP in an
UPDATING=true state indefinitely. The controller expects
UPDATING=false and UPDATED=true before proceeding, so it keeps
requeueing — leaving RTE DaemonSets in a half-baked state where
NROP never finishes configuring them.

This is especially critical during 4.16 → 4.18 upgrades: the
operator deletes the MachineConfig that provided the old SELinux
policy, which triggers MCO to roll out the change. On a paused
MCP that rollout never starts, so UPDATED stays false and the
controller requeues forever.

Skip paused MCPs so the operator can converge for all non-paused
pools and surface the paused pool names for status reporting.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Co-Authored-By: Francesco Romani <fromani@redhat.com>
When an MCP is paused, MCO will not apply the custom SELinux policy
MachineConfig to its nodes. Without this policy, RTE pods get stuck
forever trying to connect to kubelet's podresources socket — blocked
by SELinux AVC denials (container_device_plugin_t denied write on
container_var_lib_t sock_file). The operator keeps reporting
Progressing/DaemonSetIsUpdating with no indication of root cause.

Surface paused MCP state as a dedicated operator condition so users
can identify the problem directly from the CR status. Backfill the
condition on upgrade from older versions that lack it.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Co-Authored-By: Shereen Haj <shajmakh@redhat.com>
AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Add unit tests for MachineConfigsState covering custom/default/mixed
policies, paused pools, and edge cases. Add e2e test for mixed SELinux
policy across node groups.

Co-Authored-By: Francesco Romani <fromani@redhat.com>
AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Tal-or: This pull request references Jira Issue OCPBUGS-85647, which is invalid:

  • expected dependent Jira Issue OCPBUGS-84690 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

manual cherry-pick of #3843

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3f95deb2-5dc4-43f4-bcc7-d8712321f614

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from mrniranjan and shajmakh May 14, 2026 14:21
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026
@ffromani

Copy link
Copy Markdown
Member

When the PR becomes final and ready for review, please make sure to remove Co-authoured-by tags and use AI-attribution tags instead (https://aiattribution.github.io/create-attribution)

@Tal-or Tal-or changed the title WIP: [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness May 18, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2026
- Replace DefaultBaseConditions with newBaseConditions (unexported in 4.21)
- Replace metahelper.FindStatusCondition with FindCondition (4.21 local helper)
- Add isNROOperSyncedAt helper using 4.21 status.FindCondition
- Fix errors import shadowing (alias k8s apierrors, keep stdlib errors)

AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the 4.21_cherry_pick_manual_OCPBUGS-84690 branch from f236a87 to d50b242 Compare May 18, 2026 06:43
@Tal-or

Tal-or commented May 18, 2026

Copy link
Copy Markdown
Collaborator Author

/retest

@Tal-or

Tal-or commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

@ffromani
We need to make sure we're completing this backport process, since we already have it in 4.18.
current status:

  • main
  • 4.22
  • 4.21
  • 4.20
  • 4.19
  • 4.18

@ffromani

Copy link
Copy Markdown
Member

@ffromani We need to make sure we're completing this backport process, since we already have it in 4.18. current status:

* [x]  main

* [x]  4.22

* [ ]  4.21

* [ ]  4.20

* [ ]  4.19

* [x]  4.18

yes, will review today or tomorrow.

@ffromani

Copy link
Copy Markdown
Member

/approve
/lgtm

let's indeed accelerate the pace with backports, please tag me as reviewer for the remaining branches

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2026
@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Tal-or

Tal-or commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

/cherry-pick release-4.20, release-4.19

@openshift-cherrypick-robot

Copy link
Copy Markdown

@Tal-or: once the present PR merges, I will cherry-pick it on top of release-4.20, in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.20, release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Tal-or

Tal-or commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

/cherry-pick release-4.20 release-4.19

@openshift-cherrypick-robot

Copy link
Copy Markdown

@Tal-or: once the present PR merges, I will cherry-pick it on top of release-4.20 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.20 release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit 9c3bf0f into openshift-kni:release-4.21 Jun 18, 2026
14 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@Tal-or: Jira Issue OCPBUGS-85647: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-85647 has been moved to the MODIFIED state.

Details

In response to this:

manual cherry-pick of #3843

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@Tal-or: cannot checkout release-4.20,: error checking out "release-4.20,": exit status 1 error: pathspec 'release-4.20,' did not match any file(s) known to git

Details

In response to this:

/cherry-pick release-4.20, release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@Tal-or: #4025 failed to apply on top of branch "release-4.20":

Applying: objectstate/rte: per-pool MachineConfig wait function
Using index info to reconstruct a base tree...
M	internal/controller/numaresourcesoperator_controller.go
M	internal/controller/numaresourcesoperator_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging internal/controller/numaresourcesoperator_controller.go
Auto-merging internal/controller/numaresourcesoperator_controller_test.go
CONFLICT (content): Merge conflict in internal/controller/numaresourcesoperator_controller_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 objectstate/rte: per-pool MachineConfig wait function

Details

In response to this:

/cherry-pick release-4.20 release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Tal-or

Tal-or commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

/jira cherry-pick OCPBUGS-85647

@openshift-ci-robot

Copy link
Copy Markdown

@Tal-or: Jira Issue OCPBUGS-85647 has been cloned as Jira Issue OCPBUGS-90544. Will retitle bug to link to clone.
/retitle OCPBUGS-90544: [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness

Details

In response to this:

/jira cherry-pick OCPBUGS-85647

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot changed the title [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness OCPBUGS-90544: [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness Jun 21, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Tal-or: Jira Issue OCPBUGS-90544: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-90544 has been moved to the MODIFIED state.

Details

In response to this:

manual cherry-pick of #3843

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Tal-or

Tal-or commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

/retitle [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness

@openshift-ci openshift-ci Bot changed the title OCPBUGS-90544: [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness Jun 21, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Tal-or: Jira Issue OCPBUGS-85647 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

Details

In response to this:

manual cherry-pick of #3843

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants